---
layout: ../layouts/Layout.astro
title: "LiveOCRVQA: A Dynamic Benchmark for Stylized Text Understanding"
description: "LiveOCRVQA is a benchmark for evaluating the abilities of vision-language models on stylized text reading."
favicon: favicon.svg
thumbnail: screenshot-light.png
---

import Header from "../components/Header.astro";
import Video from "../components/Video.astro";
import HighlightedSection from "../components/HighlightedSection.astro";
import SmallCaps from "../components/SmallCaps.astro";
import Figure from "../components/Figure.astro";
import Image from "../components/Image.astro";
import TwoColumns from "../components/TwoColumns.astro";
import YouTubeVideo from "../components/YouTubeVideo.astro";
import LaTeX from "../components/LaTeX.astro";

import { ImageComparison } from "../components/ImageComparison.tsx";

import outside from "../assets/outside.mp4";
import transformer from "../assets/transformer.webp";
import Splat from "../components/Splat.tsx"
import dogsDiffc from "../assets/dogs-diffc.png"
import dogsTrue from "../assets/dogs-true.png"
import benchmark from "../assets/benchmark.png"
import bias from "../assets/bias.png"

import CodeBlock from "../components/CodeBlock.astro";
import Table from "../components/Table.astro";
export const components = {pre: CodeBlock, table: Table}

<Header
  title={frontmatter.title}
  authors={[
    {
      name: "Yesheng Liu",
      notes: ["*", "†", "††"],
    },
    {
      name: "Zheqi He",
      institution: "",
      notes: ["*"],
    },
    {
      name: "Jingshu Zheng",
      notes: ["*"],
    },
    {
      name: "Jinge Yao",
      notes: ["*"],
    },
    {
      name: "Xi Yang",
      notes: ["*"],
    },
    {
      name: "Jiajun Zhang",
      notes: ["†", "††"],
    },
  ]}
  notes={[
    {
      symbol: "*",
      text: "BAAI",
    },
    {
      symbol: "†",
      text: "Institute of Automation, Chinese Academy of Sciences",
    },
    {
      symbol: "††",
      text: "School of Artificial Intelligence, University of Chinese Academy of Sciences ",
    },
  ]}
  links={[
    {
      name: "Paper",
      url: "",
      icon: "ri:file-pdf-2-line",
    },
    {
      name: "Code",
      url: "https://github.com/flageval-baai/FlagEvalMM/tree/liveocrvqa/tasks/liveocrvqa",
      icon: "ri:github-line",
    },
    {
      name: "arXiv",
      url: "",
      icon: "academicons:arxiv",
    },
    {
      name: "Dataset",
      url: "https://huggingface.co/datasets/BAAI/LiveOCRVQA/",
      icon: "simple-icons:huggingface",
      iconColor: "#FFD21E",
    }
  ]}
  />


<HighlightedSection>

## Introduction

We introduce LiveOCRVQA, a dynamic benchmark designed to avoid contamination from models’ exposure to static, web-sourced data. Using a semi-automated pipeline, we continuously curate fresh images of stylized text—legible to humans but challenging for LMMs—and pair them with metadata across diverse sources. Our initial release comprises 385 instances, which we used to evaluate 21 leading models. Results reveal that many LMMs often fall back on memorized text or learned font/layout patterns rather than true character recognition. Both the dataset and evaluation suite are publicly available, and we will update them quarterly to track real-world text-understanding progress.

</HighlightedSection>

## LiveOCRVQA

LiveOCRVQA comprises 385 stylized text-image instances across four categories—Books, Movies, Albums, and Games—with carefully curated question–answer pairs to evaluate text-understanding capabilities. Comparing to existing benchmarks, LiveOCRVQA is continuously updated to avoid contamination from models’ exposure to static, web-sourced data. Evaluation shows that well-differentiated score rather than saturated score is more meaningful for evaluating the ability of LMMs to understand stylized text.


<Figure>
  <Image slot="figure" source={benchmark} altText="LiveOCRBench" />
  <span slot="caption"></span>
</Figure>

## Language Bias in LMMs
Examples illustrating LMMs produced typos in these specific, human-simple cases, including: 
1. Recognition Error: involving character-level misidentifications within words
2. Location Error: word-level omissions or ordering mistakes of critical title words.
<Figure>
  <Image slot="figure" source={bias} altText="Language Bias" />
  <span slot="caption"></span>
</Figure>

## Leaderboard



| Model                | Overall   | Album     | Book      | Game      | Movie     |
|----------------------|-----------|-----------|-----------|-----------|-----------|
| Gemini-2.5-pro       | **80.52%**| **68.31%**| **90.14%**| **85.71%**| **88.33%**|
| Gemini-2.5-flash     | 75.32%    | 66.20%    | 80.28%    | 81.25%    | 80.00%    |
| Qwen2-VL-7B          | 72.21%    | 64.08%    | 83.10%    | 71.43%    | 80.00%    |
| GPT-4o-2411          | 69.09%    | 57.04%    | **90.14%**| 74.11%    | 63.33%    |
| gpt-4o-mini          | 67.53%    | 54.93%    | 83.10%    | 70.54%    | 73.33%    |
| Qwen2.5-VL-72B       | 67.27%    | 53.52%    | 71.83%    | 75.00%    | 80.00%    |
| GPT-4o-2408          | 67.01%    | 48.59%    | 81.69%    | 79.46%    | 70.00%    |
| Qwen2.5-VL-7B        | 65.97%    | 51.41%    | 74.65%    | 72.32%    | 78.33%    |
| Qwen2-VL-2B          | 64.68%    | 52.82%    | 76.06%    | 66.07%    | 76.67%    |
| Qwen2-VL-72B         | 63.64%    | 56.34%    | 66.20%    | 66.07%    | 73.33%    |
| InternVL3-78B        | 63.64%    | 54.23%    | 67.61%    | 68.75%    | 71.67%    |
| Claude-3-7-sonnet    | 62.60%    | 38.03%    | 85.92%    | 66.07%    | 86.67%  |
| Claude-3-5-sonnet    | 59.74%    | 39.44%    | 76.06%    | 62.50%    | 83.33%    |
| Pixtral-Large        | 58.96%    | 41.55%    | 76.06%    | 60.71%    | 76.67%    |
| InternVL2.5-78B      | 50.65%    | 43.66%    | 54.93%    | 53.57%    | 56.67%    |
| InternVL3-8B         | 49.61%    | 40.14%    | 54.93%    | 49.11%    | 66.67%    |
| LLaVA-OV-7b          | 45.71%    | 45.77%    | 50.70%    | 41.07%    | 48.33%    |
| InternVL2.5-8B       | 39.22%    | 30.28%    | 52.11%    | 41.07%    | 41.67%    |
| Phi-3.5-vision       | 37.40%    | 19.01%    | 40.85%    | 53.57%    | 46.67%    |
| Idefics3-8B          | 28.05%    | 29.58%    | 39.44%    | 17.86%    | 30.00%    |
| Phi-4-multimodal     | 17.92%    | 16.20%    | 21.13%    | 17.86%    | 18.33%    |

## BibTeX citation

```bibtex
@misc{liu2025liveocrbench,
  author = "{Yesheng Liu, Zheqi He, Jingshu Zheng, Jinge Yao, Xi Yang, Jiajun Zhang}",
  title = "LiveOCRBench: Tracking Vision-Language Models' Persistent Struggles with Text Recognition",
  year = "2025",
  howpublished = "\url{https://flageval-baai.github.io/liveocrbench/}",
}
```